In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
The issues associated with validation and cross-validation are some of the most important aspects of the practice of machine learning. Selecting the optimal model for your data is vital, and is a piece of the problem that is not often appreciated by machine learning practitioners.
Of core importance is the following question:
If our estimator is underperforming, how should we move forward?
The answer is often counter-intuitive. In particular, sometimes using a more complicated model will give worse results. Also, sometimes adding training data will not improve your results. The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.
One way to address this issue is to use what are often called Learning Curves.
Given a particular dataset and a model we'd like to fit (e.g. using feature creation and linear regression), we'd
like to tune our value of the hyperparameter kernel
to give us the best fit. We can visualize the different regimes with the following plot, modified from the sklearn examples here
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn import cross_validation
rng = np.random.RandomState(42)
n_samples = 200
kernels = ['linear', 'poly', 'rbf']
true_fun = lambda X: X ** 3
X = np.sort(5 * (rng.rand(n_samples) - .5))
y = true_fun(X) + .01 * rng.randn(n_samples)
plt.figure(figsize=(14, 5))
for i in range(len(kernels)):
ax = plt.subplot(1, len(kernels), i + 1)
plt.setp(ax, xticks=(), yticks=())
model = SVR(kernel=kernels[i], C=5)
model.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_validation.cross_val_score(model,
X[:, np.newaxis], y, scoring="mean_squared_error", cv=10)
X_test = np.linspace(3 * -.5, 3 * .5, 100)
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((-3 * .5, 3 * .5))
plt.ylim((-1, 1))
plt.legend(loc="best")
plt.title("Kernel {}\nMSE = {:.2e}(+/- {:.2e})".format(
kernels[i], -scores.mean(), scores.std()))
plt.show()
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
rng = np.random.RandomState(0)
n_samples = 200
true_fun = lambda X: X ** 3
X = np.sort(5 * (rng.rand(n_samples) - .5))
y = true_fun(X) + .01 * rng.randn(n_samples)
X = X[:, None]
y = y
f, axarr = plt.subplots(1, 3)
axarr[0].scatter(X[::20], y[::20])
axarr[0].set_xlim((-3 * .5, 3 * .5))
axarr[0].set_ylim((-1, 1))
axarr[1].scatter(X[::10], y[::10])
axarr[1].set_xlim((-3 * .5, 3 * .5))
axarr[1].set_ylim((-1, 1))
axarr[2].scatter(X, y)
axarr[2].set_xlim((-3 * .5, 3 * .5))
axarr[2].set_ylim((-1, 1))
plt.show()
They all come from the same underlying process. But if you were asked to make a prediction, you would be more likely to draw a straight line for the left-most one, as there are only very few datapoints, and no real rule is apparent. For the dataset in the middle, some structure is recognizable, though the exact shape of the true function is maybe not obvious. With even more data on the right hand side, you would probably be very comfortable with drawing a curved line with a lot of certainty.
A great way to explore how a model fit evolves with different dataset sizes are learning curves. A learning curve plots the validation error for a given model against different training set sizes.
But first, take a moment to think about what we're going to see:
Questions:
We can run the following code to plot the learning curve for a kernel = linear
model:
In [ ]:
from sklearn.learning_curve import learning_curve
from sklearn.svm import SVR
# This is actually negative MSE!
training_sizes, train_scores, test_scores = learning_curve(SVR(kernel='linear'), X, y, cv=10,
scoring="mean_squared_error",
train_sizes=[.6, .7, .8, .9, 1.])
# Use the negative because we want to maximize score
print(train_scores.mean(axis=1))
plt.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
plt.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
#plt.ylim((0, 50))
plt.legend(loc='best')
You can see that for the model with kernel = linear
, the validation score doesn't really improve as more data is given.
Notice that the validation error generally improves with a growing training set, while the training error generally gets worse with a growing training set. From this we can infer that as the training size increases, they will converge to a single value.
From the above discussion, we know that kernel = linear
underfits the data. This is indicated by the fact that both the
training and validation errors are very poor. When confronted with this type of learning curve,
we can expect that adding more training data will not help matters: both
lines will converge to a relatively high error.
When the learning curves have converged to a poor error, we have an underfitting model.
An underfitting model can be improved by:
kernel
parameter)A underfitting model cannot be improved, however, by increasing the number of training samples (do you see why?)
Now let's look at an overfit model:
In [ ]:
from sklearn.learning_curve import learning_curve
from sklearn.svm import SVR
training_sizes, train_scores, test_scores = learning_curve(SVR(kernel='rbf'), X, y, cv=10,
scoring="mean_squared_error",
train_sizes=[.6, .7, .8, .9, 1.])
# Use the negative because we want to minimize squared error
plt.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
plt.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
plt.legend(loc='best')
Here we show the learning curve for kernel = rbf
. From the above
discussion, we know that kernel = rbf
is an estimator
which mildly overfits the data. This is indicated by the fact that the
training error is much better than the validation error. As
we add more samples to this training set, the training error will
continue to worsen, while the cross-validation error will continue
to improve, until they meet in the middle. We can infer that adding more
data will allow the estimator to very closely match the best
possible cross-validation error.
When the learning curves have not yet converged with our full training set, it indicates an overfit model.
An overfitting model can be improved by:
kernel
less complex with kernel = poly
)C
for SVM/SVR).In particular, gathering more features for each sample will not help the results.
We’ve seen above that an under-performing algorithm can be due to two possible situations: underfitting and overfitting. Using the technique of learning curves, we can train on progressively larger subsets of the data, evaluating the training error and cross-validation error to determine whether our algorithm is overfitting or underfitting. But what do we do with this information?
If our algorithm is underfitting, the following actions might help:
linear
<< poly
<< rbf
).
Each learning technique has its own methods of adding complexity.If our algorithm shows signs of overfitting, the following actions might help:
These choices become very important in real-world situations, as data collection usually costs time and energy. If the model is underfitting, then spending weeks or months collecting more data could be a colossal waste of time! However, more data (usually) gives us a better view of the true nature of the problem, so these issues should always be carefully considered before going on a "data foraging expedition".
In [ ]: